Method, apparatus, and computer program product for locating data in large datasets

ABSTRACT

To analyze a data set having a one-to-many relation, the number of simultaneous occurrences of data in which two data elements are coexistent is obtained for all combinations of two data elements. A dependence ratio of one data element upon the other data element is calculated from the numbers of simultaneous occurrences. The data elements are grouped based upon the numbers of occurrences of individual data elements and the dependence ratios compared with the predetermined thresholds. Based on the number of occurrences of individual data elements and the dependence ratios, subordinate relations of data elements within the same group are specified and displayed to a user in the form of a tree or balloon figure.

FIELD OF THE INVENTION

[0001] The present invention relates to a data analysis system foranalyzing detests that have one-to-many relations among data elements.

BACKGROUND

[0002] Keywords or classification codes are conventionally used tosearch for specific information contained in an enormous amount of data.For example, in the patent literature, each application includes theapplication number, the name of the invention, the applicant's name, theinventor's name, and the IPC classification. A specific patentapplication may be located in a database using the title of theinvention or the name of the applicant as keywords, for example, orusing the application number or IPC classification. The intended patentapplication can be found reliably if the keyword or classification codeis suitable.

[0003] However, with the conventional method as described above, it isvery difficult to obtain information that satisfies multiple criteria.For example, consider the case of identifying the inventors in eachtechnical field from a database of patent publications. In such a case,if the number of subject inventors is very large, or if some of theinventors are active in multiple technical fields, it is difficult toobtain precise information simply by using keywords or classificationcodes. Also, if the inventors are grouped, and the groups includeinventors with low frequency of occurrence, the number of inventorscontained in a specific group may be too large. Furthermore, it isalmost impossible to determine the relation between inventors in atechnical field, or to deduce relationships of primary and secondarycontributors to a given field.

[0004] Therefore, when such information is needed, the data is oftenarranged manually. Of course, this takes a lot of time, and isinefficient and expensive. Also, there is a lot of room for personaljudgment during the analysis, which may therefore provide differentresults depending on who does the analysis.

[0005] The above example is couched in the field of patents, as adescriptive convenience. In recent years, however, it has becomeincreasingly important in general to analyze data drawn from enormousdetests, especially in the field of genome research. The presentinvention applies as well to such more general problems.

SUMMARY

[0006] A data analysis system according to the present invention foranalyzing data with respect to a plurality of data elements or keywordscomprises: a database for storing data, an analysis processing portion(means) for analyzing data elements based on the number of occurrencesof a first data element and the number of simultaneous occurrences of asecond data element with the first data element in the data set to beanalyzed, and an output portion (means) for outputting results of theanalysis. The number of simultaneous occurrences for all thecombinations of data elements is obtained. Simultaneous occurrence meansthat the first data element and the second data element coexist in asample or subset of the data.

[0007] Based on the number of simultaneous occurrences, the dataelements can be divided into groups. That is, if the ratio of the numberof simultaneous occurrences to the number of occurrences of the firstdata element (hereinafter referred to as a “dependence ratio”) isgreater than a predetermined value, the first data element and thesecond data element by definition belong to the same group of dataelements.

[0008] Keywords may be considered in pairs. Two keywords are defined tobelong to the same group if the number of data samples that contain bothof the two keywords meets a predetermined threshold condition. Thisthreshold may be the ratio of the number of simultaneous occurrences ofthe two keywords to the number of occurrences of one keyword(hereinafter referred to as a “dependence ratio”).

[0009] In this manner, since relevant data elements are put in the samegroup only if their dependence ratio is greater than the predeterminedvalue, it is possible to limit the size of the groups by adjusting thethreshold appropriately.

[0010] The data analysis system may integrally comprise the analysisprocessing portion, the database, and the output portion, or the usermay gain access to the analysis processing portion via a network such asan Internet or LAN to receive the analysis results. The analysisprocessing portion and the output portion may also be separate. Then,the display terminal which the user employs as the output portioncomprises an interface, for requesting the data analysis by gainingaccess to the data analysis apparatus via the network, and acceptingmeans, such as a data communication apparatus, for accepting the dataanalysis results via the network from the data analysis apparatus.

[0011] That is, the display terminal according to the present inventionmay use the interface to make a request via the network. The dataanalysis apparatus that receives the request performs the analysis andforwards the results to the display terminal. In the output portion,accepting means accept the analysis results via the network, and displaymeans display the relation of the plurality of data elements in the formof a figure, based on the analysis results.

[0012] The figure may show hierarchy among the data elements as a treestructure. To create the tree, combinations of the data elements areexamined, and the combination with highest dependence ratio isextracted. In the extracted combination, the subordinate relationbetween data elements is specified based on the dependence ratio. If thedependence ratio of data element B upon data element A is higher thanthe dependence ratio of data element A upon data element B, data elementB depends upon data element A. That is, data element B is subordinate todata element A. Also, the subordinate relation can be specified, basedon the number of occurrences of two data elements. In this case, a dataelement with a smaller number of occurrences depends upon a data elementwith a larger number of occurrences.

[0013] The display may represent the data elements as figures such ascircles or balloons. The size of the figure may depend upon the numberof occurrences of the represented data element, and the distance betweenthe figures may depend upon the relation of the data elements. Herein,the relation of data elements may be a ratio of the number ofsimultaneous occurrences of two data elements to the number ofoccurrences of one data element.

[0014] The invention further includes a method for analyzing data,comprising a step of calculating a dependence ratio of one data elementupon another data element in data samples to be analyzed, a step ofgrouping the data elements based on the dependence ratios, and a step ofoutputting the grouped results. Herein, two data elements belong to thesame group, by definition, if their dependence ratio is greater than apredetermined value. Also, a subordinate relation between one dataelement and another may be specified, based on the ratio of the numberof simultaneous occurrences to the number of occurrences of one dataelement.

[0015] The present invention also includes a program to instruct acomputer to specify two relevant keywords, based on the occurrence ofkeywords in data stored in a database, and group the keywords. That is,if the keyword A and the keyword B are related, they are defined tobelong to the same group. Moreover, if the keyword B and the keyword Care related, the keyword C is defined to belong to the same group of thekeywords A and B.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016]FIG. 1 is a diagram showing a configuration of a data analysissystem according to an embodiment of the present invention;

[0017]FIG. 2 is a flowchart showing a process flow of data analysis;

[0018]FIG. 3 is a table showing exemplary data;

[0019]FIG. 4 is a table showing the exemplary data expanded;

[0020]FIG. 5 is a table listing the number of occurrences of each dataelement;

[0021]FIG. 6 is a table listing the number of simultaneous occurrencesof combinations of two data elements;

[0022]FIG. 7 is a table listing a dependence ratio and a determinationresult for the threshold;

[0023]FIG. 8 is a flowchart showing a grouping process;

[0024]FIG. 9 is a table for use in grouping;

[0025] FIGS. 10A-10C are tables listing grouped results;

[0026]FIG. 11 is a flowchart showing a process for assigning anungrouped data element to a group;

[0027] FIGS. 12A-12H are tables showing the examination for specifyingthe group upon which the ungrouped data element depends;

[0028]FIG. 13 is a flowchart showing a process for specifying acorrelation of data elements for each group;

[0029] FIGS. 14A-14G are tables showing stages of the examination madeof a group;

[0030]FIG. 15 is a table showing results in the form of tree;

[0031] FIGS. 16A-16D are tables showing stages of the examination of agroup;

[0032]FIG. 17 shows analysis results in the form of tree;

[0033] FIGS. 18A-18C are tables showing stages of an examination of agroup;

[0034]FIG. 19 shows analysis results in the form of tree;

[0035]FIG. 20 is a table listing the group upon which each data sampledepends;

[0036]FIG. 21 is a table listing results of an examination of FIG. 20;

[0037]FIG. 22 is a table listing analysis results in the form of a tree,with the data elements constituting the data sample;

[0038] FIGS. 23A-23B are tables showing an examination of a relationbetween groups in order to output the analysis results in another form;

[0039]FIG. 24 is a table listing the number of simultaneous occurrencesfor a combination of two groups;

[0040] FIGS. 25A-25B are tables showing the results of examining thenumber of simultaneous occurrences for the combination of two groups;

[0041]FIG. 26 is a table listing a state where relevant groups are takenout and the dependence ratio is calculated;

[0042]FIG. 27 shows the analysis results of the relation between groupsin the form of a tree;

[0043]FIG. 28 shows analysis results in the form of a balloon figure;

[0044] FIGS. 29A-29C are tables listing the parameters required inproducing a balloon figure;

[0045]FIG. 30 shows the results of examining the dependence ratio uponan indirectly relevant group;

[0046]FIG. 31 shows the examined results of FIG. 30 in the form of aballoon figure; and

[0047]FIG. 32 is a diagram showing an example having several kinds ofdata elements.

DETAILED DESCRIPTION

[0048] The present invention will be described below with reference tothe accompanying drawings.

[0049]FIG. 1 shows a configuration of a data analysis system accordingto an embodiment of the invention. As shown in FIG. 1, the data analysissystem comprises a database 10 for storing the data, an interface 20, ananalysis processing portion (analysis processing means) 30 for analyzingthe data stored in the database 10, and an output portion 40 such as adisplay or a printer.

[0050] In this data analysis system, a user requests data analysis usingthe interface 20. The analysis processing portion 30 accepts therequest, analyzes the data stored in the database 10, and forwards theanalysis results to the output portion 40, which then outputs theanalysis results to the user in the form of a display or printout.

[0051] The analysis processing portion 30 may be implemented by apreinstalled program and a CPU for executing the program. An analyticmethod performed by the analysis processing portion 30 will be describedbelow in sequence with reference to the flowchart shown in FIG. 2.

[0052]FIG. 3 illustrates an example of data 300 stored in the database10. As shown in FIG. 3, the data 300 comprises data samples, each havinga sample number. For each sample number (“10001” to “10031” in thefigure) appended to each data sample, the related data elements orkeywords (A to L) are listed in the data fields 1 to 4 (“data 1”, “data2”, “data 3” and “data 4”) in one-to-many form.

[0053] As a specific example, if the data is related to patentpublications, the sample number may be the application number, and thedata element may be the inventor. If the data is related to genomeresearch, the sample number may be, for example, the sample provider,and the data elements may be diseases of the sample provider or geneticfeatures of the sample provider.

[0054] When the user starts the data analysis, the analysis processingportion 30 first reads the data 300 from the database 10, and expands itto one-to-one form. At this time, the data 300 as shown in FIG. 3 isnormalized and expanded one-to-one as shown in FIG. 4 (step S101 of FIG.2).

[0055] For example, in FIG. 3, the data sample of sample number “10001”is associated with the data elements A and C. This is transformed intothe data structure 400 in which sample number “10001” and data elementA, and sample number “10001” and data element C, are associatedindividually as shown in FIG. 4.

[0056] Subsequently, the analysis processing portion 30 calculates thenumber of occurrences of each data element (step S102). For example,data element A occurs 14 times in sample numbers “10001” to “10031.”FIG. 5 shows an occurrence table 500 representing the number ofoccurrences of each data element.

[0057] The analysis processing portion 30 combines, by twos, the dataelements appearing in the table 300 of FIG. 3, and calculates the numberof data samples in which each combination occurs simultaneously (stepS103). FIG. 6 shows a simultaneous occurrence table 600 indicating thenumber of simultaneous occurrences for each combination of two dataelements. For example, the data elements A and B occur simultaneously inseven data samples; consequently, the number of simultaneous occurrencesis seven in the table 600 of FIG. 6.

[0058] Then, the dependence ratio of the data elements is calculatedbased on the number of occurrences of each data element (table 500, FIG.5) and the number of simultaneous occurrences for each combination ofdata elements (table 600, FIG. 6) (step S104). Herein, the dependenceratio of data elements is a ratio in which one data element depends onanother data element, in other words, the ratio of the number ofsimultaneous occurrences of a first data element and a second dataelement to the number of occurrences of the first data element.

[0059] For example, the ratio of the number of simultaneous occurrenceswherein A and B are joint inventors, to the total number of occurrencesof inventor A is calculated. This is the dependence ratio by whichinventor A depends on inventor B.

[0060]FIG. 7 shows a table 700 of the dependence ratios between dataelements, calculated as described above. For example, in calculating thedependence ratio of data element A upon data element B in the tables 500and 600 of FIGS. 5 and 6, because the number of occurrences of dataelement A is 14 and the number of simultaneous occurrences of dataelements A and B is 7, the dependence ratio of data element A as themain data upon data element B as the subordinate data is equal to7÷14=0.50.

[0061] Further, a check is made to determine whether the number ofoccurrences of the data element as the main data, the number ofsimultaneous occurrences of the main data and subordinate data, and thecalculated dependence ratio satisfy the respective predeterminedthreshold conditions (step S105).

[0062] For the number of occurrences of the data element as main data,the threshold value is set to 3 in this embodiment. A check is made todetermine whether the number of occurrences of the data element as maindata is greater than this threshold value. If the condition issatisfied, the flag is set to “1”, and otherwise to “0”. The purpose ofthis threshold value is to exclude data samples with small numbers ofoccurrences from the subsequent processing and to illustrate trends inthe data without being caught up in the details.

[0063] For the number of simultaneous occurrences of the main data andthe subordinate data, the threshold value is set to “1” in thisembodiment. A check is made to determine whether the number ofsimultaneous occurrences is greater than the threshold value. If thecondition is satisfied, the flag is set to “1”, and otherwise to “0”.The purpose of this threshold is to exclude data samples with smallnumbers of simultaneous occurrences, even if the dependence ratiosatisfies the threshold, as will be described later.

[0064] For the dependence ratio, the threshold is set to 0.60 in thisembodiment. A check is made to determine whether the dependence ratio ofmain data upon subordinate data is greater than the threshold value. Ifthe condition is satisfied, the flag is set to “1”, and otherwise to“0”. The purpose of this threshold is to employ data with highdependence ratios, namely data elements of high relevance, in thesubsequent processing.

[0065] Based on the three flags, the conditional flag is set to “1” forany combination of data elements in which all the conditions ofthresholds are satisfied, namely, all the three flags are “1”; theconditional flag is set to “0” for other combinations.

[0066] In FIG. 7, in a combination of data element A as the main dataand data element B as the subordinate data, the number of occurrences ofdata element A as the main data is 14, which is greater than the firstthreshold 3. Consequently, the first flag is set to “1”. Also, becausethe number of simultaneous occurrences of data element A as the maindata and data element B as the subordinate data is 7, which is greaterthan the second threshold 1, the second flag is set to “1”. Because thedependence ratio of data element A as the main data upon data element Bis 0.50, which is less than the third threshold 0.60, the condition isnot satisfied, and the third flag remains “0”. The flag is “0” for oneof the three conditions; consequently, the conditional flag remains “0”.

[0067] For example, in a combination of data element B as the main dataand data element A as the subordinate data, the number of occurrences ofdata element B as the main data is 9, which is greater than the firstthreshold 3. Consequently, the first flag is set to “1”. Also, becausethe number of simultaneous occurrences of data element B as the maindata and data element A as the subordinate data is 7, which is greaterthan the second threshold 1, the second flag is set to “1”. Because thedependence ratio of data element B as the main data upon data element Ais 0.78, which is greater than the third threshold 0.60, the thirdcondition is satisfied, and the flag is set to “1”. Thus all three ofthe flags are equal to “1”; consequently, the conditional flag is set to“1”.

[0068] In this embodiment, when the three flags satisfy the ANDcondition, the conditional flag is set to “1”, but the invention is notlimited thereto.

[0069] Based on the conditional flag set up in the manner describedabove, the analysis processing portion 30 groups the data elements (stepS106). FIG. 8 is a flowchart showing the grouping process. FIG. 9 is anexample of a grouping table 900 listing the results of grouping inaccordance with the flowchart of FIG. 8. In the table 900, “Y” indicatesa combination for which the conditional flag is set to “1”.

[0070] The flag for the number of occurrences of data element as themain data is checked, and if the flag is “0”, the data element isexcluded from the examination (step S201). For example, in the table 700of FIG. 7, the data elements E, F, J, K and L have the flag “0”, and aretherefore excluded from the examination. In this example, the dataelements A, B, C, D, G, H and I, other than the data elements E, F, J, Kand L that are excluded from the examination, are examined successivelyin a predetermined order, such as in alphabetical order.

[0071] Subsequently, combinations of each of the data elements to beexamined with other data element are searched successively. If thedependence ratio is greater than the threshold, the data element that isa partner of the combination is defined as being in the same group.

[0072] Initially, the first data element (data element A in the exampleof FIG. 9) is made the examination subject (step S202), and set as a newgroup (first group “Gr1”) (step S203).

[0073] Subsequently, the combination having the first data element (dataelement A) as subordinate data is searched (step S204). When thecombination with the conditional flag “1” is hit, the subject dataelement as subordinate data and the data element as main data occurringsimultaneously therewith are defined as the same group (steps S205 andS206). In the example of FIG. 9, the conditional flag “1” is set in thecombination having data element B as main data. Thus, data element B isdefined as the group “Gr1”.

[0074] Thereafter, returning to step S204, the combination having thesubject data element as the subordinate data is searched consecutivelyuntil the combination with the conditional flag “1” is not hit. In theexample of FIG. 9, there is no combination in which the dependence ratioof data element A is greater than the threshold. Searching thecombination with data element A as subordinate data is ended.

[0075] When searching the combination having the subject data element asthe subordinate data is ended, the next data element is made theexamination subject (steps S207 and S208), and the combination havingthe new subject data element as the subordinate data is searched,wherein the processing from step S204 to step S207 is repeated.

[0076] In the example of FIG. 9, when the combination is searched havingdata element B for the new examination subject as the subordinate data,no combination is found with the conditional flag “1”.

[0077] Further, the combination with data element C as the subordinatedata and data element B as the main data has a dependence ratio greaterthan the threshold value. Consequently, since data element B as the maindata is already assigned to the group “Gr1”, data element C is assignedto group “Gr1”.

[0078] The search is continued while the examination subject is changed.The combination of data element with data element D as the main data hasa dependence ratio greater than the threshold. Since data element C isalready assigned to the group “Gr1”, data element D is also assigned togroup “Gr1”.

[0079] Since there is no other combination in which the dependence ratioof data element C is greater than the threshold, searching for dataelement C is ended.

[0080] Searching for the data elements D, G, H and I is similarlyperformed, but since there is no combination that includes the dataelements A, B, C and D that are assigned to the group “Gr1”, searchingthe group “Gr1” is completed.

[0081] In this manner, once all the combinations of data elements as theexamination subject are examined, a check is made to determine whetherany data element remains that is not grouped (step S209). If so, thefirst data element of the remaining data elements is made theexamination subject (step S210), and the examination continues,returning to step S203. In the example of FIG. 9, the data elements A,B, C and D are already assigned to the group “Gr1”, but the dataelements G, H and I are not yet assigned to the group. Thus, of theremaining data elements, data element G is next examined as the subject.Herein, at step S203, the subject data element G is assigned to a newgroup “Gr2”.

[0082] Thereafter, the processing from step S204 to step S208 isrepeated. That is, with data element G as the subordinate data, a searchis made for the combination in which the dependence ratio is greaterthan the threshold. Then, the conditional flag “1” is set in thecombination for which data element H is the main data. Thus, dataelement H is set as the group “Gr2”.

[0083] When data element G is the subordinate data and data element I isthe main data, the dependence ratio is greater than the threshold. Thus,data element I is assigned to group “Gr2”.

[0084] Moreover, with data element H as the subordinate data, a searchis made for the combination in which the dependence ratio is greaterthan the threshold. Since there is no combination with the conditionalflag “1”, the combination with data element I as subordinate data issubsequently searched.

[0085] With data element I as the subordinate data and data element G asthe main data, the dependence ratio is greater than the threshold.However, since data element G is already assigned to the group “Gr2”, nofurther processing is required.

[0086] With data element I as the subordinate data and data element H asthe main data, the dependence ratio is greater than the threshold.However, since data element H is already assigned to group “Gr2”, nofurther processing is required.

[0087] In addition, since there is no other combination in which thedependence ratio is greater than the threshold, with data element G asthe subordinate data, searching for data element G is ended.

[0088] All the subject data elements have now been assigned to thegroups; consequently, the grouping process is ended.

[0089] As a result of this process, as shown in FIG. 9 and FIG. 10A, thedata elements A, B, C and D are assigned to the group “Gr1”, and dataelements G, H and I are assigned to the group “Gr2.”

[0090] In this manner, the data elements can be grouped, employing thedependence ratio.

[0091] As listed in table 700 of FIG. 7, data element E occurssimultaneously in the combination with data elements A, B, C and D, butbecause the number of occurrences of data element E is small, i.e.,below the threshold, data element E is excluded from the grouping. Inthis way, data elements with small frequencies of occurrence are ignoredat this time. Consequently, it is possible to prevent a huge number ofdata elements from being classified into the same group.

[0092] Also, in the table 700 of FIG. 7, data element G that is finallyassigned to the group “Gr2” occurs simultaneously in the combinationwith data element A assigned to the group “Gr1.” However, because thedependence ratio is small, i.e., below the threshold, data element G isexcluded from the grouping.

[0093] When the data elements are grouped in the above way, dataelements not belonging to any group may remain, because the number ofoccurrences of data is small or the dependence ratio is small. Thus, aprocess for specifying the group that those ungrouped data elementsbelong to is performed (step S107). The following process is optional inaccordance with a user's selection. FIG. 11 is a flowchart showing thedetailed flow of this process. This process is repeated until, for eachsubject data element, there are no data elements to be processed.

[0094] Each of the ungrouped data elements is assigned to a temporarygroup (step S301).

[0095] FIG 10A shows a table 1010 that gives the data element and thegroup to which the data element belongs according to the above groupingprocess. Data elements A, B, C and D are assigned to the group “Gr1”,data elements G, H and I are assigned to the group “Gr2”, and dataelements E, F, J, K and L are not assigned to any group.

[0096] As shown in table 1020 of FIG. 10B, the data elements E, F, J, Kand L are assigned to the temporary groups “Gr10003”, “Gr10004”,“Gr10005”, “Gr10006” and “Gr10007”, respectively.

[0097] Subsequently, the group that each individual subject data elementbelong to is specified by examining the data sample including individualdata elements assigned to the temporary group (step S302).

[0098] First, in the data sample including the subject data element, theratio (α) of other data elements occurring simultaneously with thesubject data element is calculated as:

Ratio (α)=1/(number of data elements included in data sample number ofdata elements belonging to the same group of the subject data element).

[0099] When the denominator is equal to zero, the data element is notassigned to any existing group (jump to step S306).

[0100] For example, the table 1210 of FIG. 12A describes the examinationof the subject data element E. In the data of sample number “10007”, theratio of other data elements A, B and C occurring simultaneously withdata element E is calculated as: Ratio (α)=1/(4−1)=0.33. Also, in thedata sample “10008”, the ratio of data elements C and D occurringsimultaneously with the subject data element E is calculated as: Ratio(α)=1/(3−1)=0.50.

[0101] Then, in the entire data sample including the subject dataelement, the dependence ratio of the subject data element upon each ofthe existing groups is calculated, based on the groups that other dataelements occurring simultaneously with the subject data element belongto. Here, the data elements that belong to the same group of the subjectdata element are excluded.

[0102] For each group, the ratios (α) for the data elements assigned tothat group are added, and the resulting sum is divided by the number ofthe subject data sample. For example, in the example of FIG. 12A, otherdata elements A, B and C occurring simultaneously with the subject dataelement E in the data sample “10007” and the data elements C and D inthe data sample “10008” are all assigned to the group “Gr1”. Therefore,the ratio (α) for the data elements A, B, C and D assigned to the group“Gr1” are added up, and the added value Σ(α) is calculated as:

Σ(α)=0.33+0.33+0.33+0.50+0.50=2.00

[0103] As shown in table 1220 of FIG. 12B, the number of the subjectdata sample is two, including data sample “10007” and “10008.” Thedependence ratio (β) of the subject data element E upon the group “Gr1”may be calculated as:

β=(α)/(number of data sample)=2.00/2=1.00

[0104] In this manner, for the subject data element, the group with thehighest dependence ratio is assigned as the group that the subject dataelement belongs to (steps S303 and S304).

[0105] In FIG 10B, the temporary group is replaced with the newlyassigned group. At this time, if there is no newly assigned group, theidentification number indicating no-group is assigned (step S305).Thereby, the process for assigning the subject data element to the groupis completed.

[0106] In the example of FIG. 12A, for the subject data element E, onlythe dependence ratio (β) upon one group “Gr1” is calculated, and dataelement E is naturally assigned to this group. The group of data elementE is replaced with “Gr1”, as shown in FIG. 10C.

[0107] Tables 1230 and 1240 of FIGS. 12C and 12D describe theexamination of the subject data element F in the same manner as above.For data element F, the dependence ratio (β) of the data sample “10009”and “10010” upon the group “Gr1” is 1.00. Accordingly, the group of dataelement F is replaced with “Gr1,” as shown in FIG. 10C.

[0108] Table 1250 of FIG. 12E describes the examination of data elementJ. Data element J is single, without any other effective groups, and istherefore assigned to the group with the identification number “Gr9999,”indicating no-group. The group of data element J is replaced with the“Gr9999,” as shown in FIG. 10C.

[0109] Tables 1260 and 1270 of FIGS. 12F and 12G describe theexamination of data element K. For data element K, the dependence ratio(β) upon the temporary group “Gr10007,” which is assigned temporarily todata element L in the data sample “10031,” is 1.00. Accordingly, thegroup of data element K is replaced with “Gr10007”.

[0110] Table 1280 of FIG. 12H describes the examination of data elementL. For data element L, there is no effective dependence ratio (β),because the temporary group “Gr10007” assigned to data element K in thedata sample “10031” corresponds to the same group as the subject dataelement. Because the data element is not single, the process ends.

[0111] In this manner, new groups are created for data elements that areassigned neither to the existent group nor to the no-group category, butinstead remain in the temporary group (steps S306 and S307). Forexample, data elements K and L are not assigned to any of the groups“Gr1”, “Gr2” and “Gr9999”, but instead remain in the temporary group“Gr10007”. Hence, the temporary group is replaced with the new group“Gr3”, as shown in FIG. 10C.

[0112] In this manner, the data elements are effectively decomposed intogroups of appropriate sizes that are obtained by setting the thresholdsuitably, particularly when the data set is especially large.

[0113] Next, the data to be output by the output portion 40 is created,based on the grouping process that is performed by the analysisprocessing portion 30 as described above (step S108). Although manysuitable structures may be used, in this embodiment the correlation ofdata elements is represented in the form of a tree. FIG. 13 is aflowchart showing the process flow for displaying the tree.

[0114] For this purpose, the subordinate relation of data elementsassigned to each group is examined. First of all, when the subject dataelement assigned to the group is the main data and another data elementassigned to the same group is the subordinate data, the number ofoccurrences of the main data, the number of occurrences of thesubordinate data, the number of simultaneous occurrences of the maindata and the subordinate data, and the dependence ratio of the main dataupon the subordinate data are acquired from table 700 of FIG. 7. Thedata elements are sorted in descending order according to theirdependence ratios. Then, if there are multiple combinations with equaldependence ratios, the data elements are sorted in ascending order ofpriority by the number of occurrences of the subordinate data, and thesubordinate data names are sorted in ascending alphabetic order. Thecombination with highest dependence ratio is then extracted. Thesubordinate data in the extracted combination is the data element uponwhich the main data element is most highly dependent (step S401).

[0115] Table 1410 of FIG. 14A describes the examination of main dataelement A of group “Gr1” when other data elements B, C, D, E and Fassigned to the same group “Gr1” are the subordinate data. Herein, thedependence ratio 0.50 for the combination when data element B is thesubordinate data is listed at the uppermost level.

[0116] Table 1420 of FIG. 14B describes the examination of data elementB assigned to the group “Gr1” in the same way as above, in which thecombination when data element A is the subordinate data is listed at theuppermost level. Table 1430 of FIG. 14C describes the examination ofdata element C, in which the combination when data element B is thesubordinate data is listed at the uppermost level. Table 1440 of FIG.14D describes the examination of data element D, in which thecombination when data element C is the subordinate data is listed at theuppermost level. Table 1450 of FIG. 14E describes the examination ofdata element E, in which the combination when data element C is thesubordinate data is listed at the uppermost level. Table 1460 of FIG.14F describes the examination of data element F, in which thecombination when data element C is the subordinate data is listed at theuppermost level.

[0117] The combinations of data elements are extracted as shown in table1470 of FIG. 14G.

[0118] Then, in the extracted combinations, the data elements are rankedfrom the correlations of data elements. First, the main data and thesubordinate data are specified as “data elements in circular relation”,if there is a combination of data in which the main data and thesubordinate are exchanged. That is, the combinations with a first dataelement as the main data and a second data element as the subordinatedata, and the combinations with the first data element as thesubordinate data and the second data element as the main data, are allspecified (step S402).

[0119] For example, in table 1470 of FIG. 14G, because the combinationin which data element A is the main data and data element B is thesubordinate data, and the combination in which data element B is themain data and data element A is the subordinate data are present, dataelements A and B are data elements in circular relation.

[0120] Herein, one of the two data elements in circular relation ispresumed to be located at the uppermost level among the two or more dataelements assigned to the same group. For example, when the data elementsare inventors' names, inventors at lower levels depend upon inventors atupper levels in successive order, while the uppermost inventornecessarily depends on the lowest inventor.

[0121] Then, of the two data elements in circular relation that arespecified, the data element having a greater number of occurrences ofdata is set to the uppermost level, “level 1”, and the data elementhaving a smaller number of occurrences is set to the lower level, “level2” (steps S403 and S404). In the example of FIG. 14, since the number ofoccurrences of data element A is 14, and the number of occurrences ofdata element B is 9, data element A is set to “level 1”, and dataelement B is set to “level 2.”

[0122] Thereafter, a search is made for other data elements dependentupon the data elements at “level 1” or “level 2.” If any such dataelement is found, it is correlated at the next lower level of the dataelement set at “level 1” or “level 2” (steps S405 to S407).

[0123] When one data element has a plurality of data elements correlatedat the lower level, the data elements are sorted in descending order ofpriority according to dependence ratio, in descending order according tothe number of occurrences, and alphabetically by data element name, andthen associated. This process is repeated until there is nocorresponding data element (steps S408 and S409).

[0124] In the example of FIG. 14, no data element dependency is found ifany data element other than data element B dependent upon data element Aat “level 1” is searched. Hence, the data element dependent upon dataelement B at the lower “level 2” is searched. Then, since data element Cis dependent upon data element B as the subordinate data, data element Cis set to “level 3,” which is lower by one than the level of dataelement B.

[0125] Then, if the data element dependent upon data element C issearched, the data elements D, E and F are extracted, and set to “level4,” which is lower by one than the level of data element C. Each of thedata elements D, E and F has a dependence ratio of 1.00 upon dataelement C. The number of occurrences of data element D is 3, whereas thenumbers of occurrences of data elements E and F are 2. Therefore, dataelement D has the highest order of priority among D, E, and F. Also,since data elements E and F each occur twice, the priorities of dataelements E and F are set according to alphabetical order of their dataelement names.

[0126] After the data elements are sorted, an association diagram Zd1500 is created in the form of a tree, as shown in FIG. 15. In theassociation diagram Zd, which is output by the output portion 40, dataelements are represented at predetermined intervals in one direction ateach of “level 1”, “level 2”, and so on. Furthermore, the data elementsat two upper and lower levels, such as the data elements of “level 1”and “level 2” and the data elements of “level 2” and “level 3”, are tiedby the link line 1501, where the node 1502 represents correlation.

[0127]FIG. 15 shows an association diagram Zd 1500 of the data elementsA, B, C, D, E and F assigned to the group “Gr1.” Herein, the assignedlevel, the number of occurrences of data, and the dependence ratio maybe displayed with each of the data elements.

[0128] Similarly, tables 1610, 1620, and 1630 of FIGS. 16A to 16C showthe combinations of the data elements G, H and I assigned to the group“Gr2.” The combination with data element G as the main data and dataelement I as the subordinate data, the combination with data element Has the main data and data element G as the subordinate data, and thecombination with data element I as the main data and data element G asthe subordinate data are extracted as those having the highestdependence ratio, as shown in table 1640 of FIG. 16D.

[0129] Data elements G, H and I are classified into levels from theextracted combinations. An association diagram Zd 1700 representing thecorrelation of the data elements G, H and I is produced, as shown inFIG. 17.

[0130] Similarly, tables 1810 and 1820 of FIGS. 18A and 18B show thecombinations of the data elements K and L assigned to the group “Gr3.”The combination with data element K as the main data and data element Las the subordinate data, and the combination with data element L as themain data and data element K as the subordinate data, are extracted asthose having the highest dependence ratio, as shown in table 1830 ofFIG. 18C. Data elements K and L are classified into levels from theextracted combinations. An association diagram Zd 1900 representing thecorrelation of the data elements K and L is produced, as shown in FIG.19.

[0131] The output portion 40 may output the analysis results to the userin another form, based on the grouping process performed by the analysisprocessing portion 30 in the manner described above. For example, theanalysis processing portion 30 may employ the data sample as a key,rather than the data element as a key. For this purpose, the dependenceratio of data sample upon individual groups is calculated, based on thegroups to which the data elements belonging to each data sample areassigned, whereby the group to which each data sample belongs isspecified. Then, the dependence ratio (γ) of each data sample uponindividual group may be calculated in accordance with the followingexpression:

Dependence ratio (γ)=(number of data elements dependent upon thegroup)/(total number of data elements constituting the data sample).

[0132] If the data sample is dependent upon more than one group, thegroups are sorted in descending order of priority according todependence ratios, in ascending order according to the number ofoccurrences of data elements belonging to the group, and alphabeticallyby group names. The group at the uppermost level is specified as thegroup to which the data sample belongs.

[0133]FIG. 20 shows a table 2000 listing the data elements A to Lbelonging to the data samples “10001” to “10031” of FIG. 3, and showinggroups information from FIG. 10C. For example, in the data sample“10001”, the data elements A and C belonging to the data sample “10001”are both assigned to the group “Gr1”.

[0134] Accordingly, data sample “10001” depends on the group “Gr1”alone, and naturally, data sample “10001” is specified as belonging tothe group “Gr1.”

[0135] In this connection, the dependence ratio (γ) may be calculatedas:

Dependence ratio (γ)=2/2=1.00.

[0136] For example, the data sample “10024” has data element A assignedto group “Gr1” and data elements G and I assigned to group “Gr2”. Inthis case, the data sample “10024” has a dependence ratio (γ) upon thegroup “Gr1” given by: Dependence ratio (γ)=1/3=0.33, and has adependence ratio (γ) upon the group “Gr2”, given by: Dependence ratio(γ)=2/3=0.67. Accordingly, since the dependence ratio (γ) upon the group“Gr2” is highest, the data sample “10024” is specified as belonging togroup “Gr2.”

[0137]FIG. 21 shows a table 2100 listing the groups that the data samplebelongs to as specified above. Further, for each group that anindividual data sample belongs to, the association diagram Zd of dataelements 1500 as shown in FIG. 15, and the relation between individualdata samples and data elements may be illustrated, as shown in FIG. 22.For example, in FIG. 22, the data sample to which the data elements A,B, C, D, E and F in the group “Gr1” belong is marked with sign “•”.

[0138] Also, the output portion 40 may output the analysis results tothe user in another form, based on the grouping process performed by theanalysis processing portion 30 as described above. For example, theanalysis processing portion 30 may employ the relation between groups asa key. For this, the number of occurrences of each group is calculatedfrom the number of occurrences of each data element, and the group towhich each data element is then assigned.

[0139]FIG. 23A shows a table 2310 listing the data elements A to Lbelonging to the data samples “10001” to “10031”, respectively, of FIG.4, and showing group information for groups “Gr1” to “Gr3” and “Gr9999”from FIG. 10C. For example, the table 2310 indicates that data element Aoccurs 14 times, and belongs to group “Gr1”.

[0140] Based on this, the number of occurrences of each of the groups“Gr1”, “Gr2”, “Gr3” and “Gr9999” is calculated for the data samples“10001” to “10031”, as shown in table 2320 of FIG. 23B. For example, thenumber of occurrences of the group “Gr1” is 42.

[0141] Subsequently, to obtain the correlation between groups, thecombination of two data elements belonging to different groups isextracted from the combinations of data elements in the data sample,whereby the combination of relevant groups is specified.

[0142]FIG. 24 shows a table 2400 listing the number of simultaneousoccurrences of the combinations of two data elements of FIG. 6, andshows group information for groups “Gr1” to “Gr3” and “Gr9999” assignedto the data elements A to L from FIG. 10C. For example, the table 2400indicates that in the combination of data elements A and B, data elementA belongs to group “Gr1” and data element B belongs to group “Gr1,” andthat the number of simultaneous occurrences is 7.

[0143]FIG. 25A shows a table 2510 listing the number of simultaneousoccurrences for the combinations of two groups, based on FIG. 24, inwhich the number of simultaneous occurrences for the combination of thegroup “Gr1” to which the data elements A, B, C, D, E and F belong, andthe group “Gr2” to which the data elements G, H and I belong, is equalto 2. However, the table has redundant information regarding membershipin the same group twice rather than in two different groups;consequently, the combinations of identical groups is excluded.

[0144]FIG. 25B shows a table 2520 listing the number of simultaneousoccurrences for the combinations of two different groups. For example,the number of simultaneous occurrences of the combination of groups“Gr1” and “Gr2” is 2, and other combinations are 0. That is, only thegroups “Gr1” and “Gr2” are correlated. For the correlated groups, thedependence ratio between groups is calculated in the same manner as forthe data elements, thereby creating data for outputting the associationdiagram in the form of a tree.

[0145] In the exemplary table 2600 of FIG. 26, for the correlated groups“Gr1” and “Gr2”, when group “Gr1” is the main data and group “Gr2 ” isthe subordinate data, the number of occurrences of group “Gr1” as themain data is 42, and the number of simultaneous occurrences of groups“Gr1” and “Gr2” is 2. Consequently, the dependence ratio (δ) of the maindata upon the subordinate data may is: Dependence ratio (δ)=2/42=0.05.Similarly, when group “Gr2” is the main data, the dependence ratio (δ)upon the group “Gr1” is 0.11.

[0146] Based on this, data is created for the association diagram Zg2700 of groups as shown in FIG. 27, in the same manner as the processfor creating the association diagram Zd 1500 of data elements shown inFIG. 15. The output portion 40 outputs the association diagram Zg 2700of groups to the user.

[0147] The association of data elements may also be indicated by usingballoon figures as shown in association diagram 2800 of FIG. 28. Foreach group, the number of occurrences of each data element isproportional to the area of an associated balloon (circle). Otherrelevant data elements are connected by the links L. The dependenceratio of a data element upon another data element is represented by thedistance between connected balloons.

[0148] The square root of the number of occurrences of each data elementis calculated, and multiplied by a scaling factor P to obtain a diameter(d) of balloon, as follows: d=(square root of the number of occurrencesof data element)×(P) . The distance between the centers of the balloonsrepresenting the correlated data elements, namely the length S of linkL, may be defined as: S=(square root(1/(dependence ratio+M)×(summationof radius of balloon of correlated data elements)×Q. Herein, a smallnumber M is added so that the denominator of the fraction may not bezero when the dependence ratio is zero. Also, the dependence ratio of alower-level data element upon an upper-level data element is employed.

[0149]FIG. 29A shows a table 2910 listing the diameters of the balloonsrepresenting the data elements A, B, C, D, E and F making up the group“Gr1.” For example, for data element A, the number of occurrences is 14,the dependence ratio upon data element B is 0.50, and the diameter d maybe calculated as: d=7SQRT(14)=26.2 (mm), where P is equal to 7.0.

[0150]FIG. 29B shows a table 2920 listing the summations of the radii ofballoons of correlated data elements, (1/(dependence ratio +M))^(0.5),and the lengths S of the links L between data elements, in which thesummation ds of radii of associated data elements A and B is calculatedas: ds=26.2/2+21.0/2=23.60.

[0151] Since the dependence ratio of lower-level data element B uponupper-level data element A is 0.78,

(1/(dependence ratio+M))^(0.5)=(1/0.78)^(0.5)=1.13.

[0152] Accordingly, the length S of link L between the data elements Aand B may be calculated as S=1.13×23.6×1.0=26.72, where the constant Qis equal to 1.0.

[0153] The balloon FIG. 28 shows the association of the data elements A,B, C, D, E and F making up the group “Gr1,” based on the information ofFIGS. 29A and 29B as calculated in the above manner. Herein, the dataelements are arranged from highest to lowest level successively. Each ofthe data elements is represented by a balloon having the diameter d asindicated in table 2910 of FIG. 29A. The length of link L between thecenter of the balloon representing data element A at “level 1” and thecenter of the balloon representing data element B at “level 2” is 26.72mm.

[0154] Herein, the balloon of data element C is linked with the balloonof data element B at upper “level 2,” and the balloons of data elementsD, E and F at lower “level 4.”

[0155] When the center of one balloon is intersected by a plurality oflinks L, the angle made by one link to an adjacent link L isproportional to the number of occurrences of the data element to belinked. For example, the numbers of occurrences of data elements B, D, Eand F linked to data element C are 23, 2, 3 and 2, respectively, and thetotal is 30, as shown in table 2930 of FIG. 29C. Accordingly, anoccupancy angle θ of each data element over a total circumference of 360degrees may be defined as: θ=(number of occurrences/total number ofoccurrences)×360. For example, the occupancy angle θ of data element Bmay be calculated as: θ=(23/30)×360=276 degrees

[0156] If the link L corresponding to each data element is arranged inthe center of the occupancy angle θ, a link-to-link angle θm with theadjacent other data element may be defined as: θm=(occupancy angle (θ)of data element+occupancy angle (θ) of adjacent other data element)/2.For example, the link-to-link angle θm with data element E adjacent tothe link L of data element B is calculated as: θm=(276+24)/2=150degrees.

[0157] Table 2930 Of FIG. 29C shows the angles of the links L connectingthe data elements B, D, E and F to data element C, calculated in theabove manner.

[0158] For the data element having two links L at upper and lowerlevels, such as data element B, the length S of the link L locatedbetween the upper and lower level data elements may be determined, basedon the dependence ratio upon the upper level data element and thedependence ratio upon the lower level data element. For example, asshown in table 3000 of FIG. 30, for data element B, the length S of thelink L located between the upper level data element A and the lowerlevel data element C is:

Length (S)=(1/(dependence ratio+M))^(0.5)×(summation of radius ofballoon of correlated data elements)×Q.

[0159] Herein, since the dependence ratio of the lower level dataelement C upon the upper level data element A is 0.42: S=39.07 mm.

[0160]FIG. 31 shows the relation between the data elements A and C inthe form of a balloon FIG. 3100. In the above configuration, wherein thedata has a one-to-many relation, the data elements can be groupedaccording to the numbers of occurrences of each data element and thedependence ratios with the predetermined thresholds. At this time, sincethe data elements with small numbers of occurrences and the dataelements with lower dependence ratios can be excluded, it is possible tomake the grouping efficiently by limiting the number of data elementsbeing classified into the specific group.

[0161] Further, data elements not belonging to any group can be grouped,based on the group that a simultaneously occurring data element belongsto. Additionally, the subordinate relation of data element within thesame group may be specified, based on the numbers of occurrences of thedata elements and the dependence ratios, and displayed to the user as atree or balloon figure. This allows the user to spatially envision theresults of the data analysis.

[0162] Additionally, the data analysis results may be displayed visuallyin the above manner. By selecting an area of the display that shows thedata element or group, the selected data element or group may bespecified as a data retrieval condition.

[0163] When no data field is explicitly defined, for example in the caseof patents, the data may be sentences, phrases, or chapters in the textof the specification, and the words contained therein may be the dataelements. Thus a grouping of words or a display of figures representingthe relation between words can be made.

[0164] Further, the similarity of data may be measured by comparing thesimilarity of tree figures or balloon figures.

[0165] In the above exemplary embodiment, the system includes thedatabase 10, the interface 20, the analysis processing portion 30 andthe output portion 40; other embodiments may be configured as integratedsystems. For example, the database 10 or the analysis processing portion30 may communicate with the interface 20 or the output portion 40 via anetwork such as the Internet or LAN, or the analysis processing portion30 may be provided on the user's side, along with the interface 20 orthe output portion 40, to enable access to the database 10 via thenetwork.

[0166] Data elements are not limited to one kind, for example inventors'names, but may comprise of a plurality of kinds of data, for examplegenetic features and diseases. In this case, when representing dataelements in the form of tree figures, the display areas 3210 and 3220may be divided as shown in FIG. 32.

[0167] A program for enabling a computer to execute the data analysisprocess as shown in the above embodiment may be provided in the form ofthe storage medium or the program transmission unit as cited below. Thatis, the storage medium may be a computer readable storage medium such asCD-ROM, DVD, memory or hard disk to store the program for enabling thecomputer to execute the data analysis process.

[0168] Also, the program transmission unit may comprise storage meanssuch as CD-ROM, DVD, memory or hard disk to store the above program, andtransmission means for transmitting the program via the network such asthe Internet or LAN on the side of the apparatus for reading the programfrom the storage means and executing the program. This programtransmission unit is suitable particularly for installing the programfor performing the above process in a computing analysis apparatus orthe like.

[0169] While the invention has been described in terms of preferredembodiments, it should be understood that numerous modifications may bemade thereto without departing from the spirit and scope of theinvention as defined by the appended Claims.

I claim: 1, A data analysis system for analyzing data sets, comprising:a database for storing data; analysis processing means for analyzing arelation among data elements based on a number of occurrences, in storeddata, of a first data element, and a number of simultaneous occurrences,in the stored data, of the first data element and a second data element;and output means for outputting results provided by the analysisprocessing means. 2, The data analysis system according to claim 1,wherein the analysis processing means groups data elements based on thenumber of simultaneous occurrences. 3, The data analysis systemaccording to claim 2, wherein the analysis processing means groups thefirst data element and the second data element into the same group, whena ratio of the number of simultaneous occurrences to the number ofoccurrences of the first data element is greater than a predeterminedvalue. 4, The data analysis system according to claim 1, wherein theoutput means represents dependencies among the data elements as a treefigure. 5, Apparatus for data analysis, comprising: analysis processingmeans for specifying pairs of keywords based on frequencies thatkeywords occur in a set of keywords of data stored in a database, andgrouping the set of keywords into groups of keywords based on thespecified pairs of keywords; and output means for outputting resultsprovided by the analysis processing means. 6, The apparatus for dataanalysis according to claim 5, wherein the analysis processing meanscalculates a number of data samples that include both keywords of thepair of keywords, and groups both keywords into the same group if thecalculated number of data samples that include both keywords meets apredetermined threshold. 7, The data analysis apparatus according toclaim 6, wherein the analysis processing means defines the group towhich one of the two keywords belongs, based on the group to which theother of the two keywords belongs, when the predetermined threshold isnot met. 8, The data analysis apparatus according to claim 5, whereinthe output means displays results provided by the analysis processingmeans in the form of a figure. 9, A display terminal comprising: aninterface for requesting analysis of data; accepting means for acceptinganalysis results giving a relation based on a number of occurrences oftwo data elements of a plurality of data elements; and output means fordisplaying the relation as a figure, based on the analysis results. 10,The display terminal according to claim 9, wherein the output meansdisplays subordinate relations of the data as a tree figure. 11, Thedisplay terminal according to claim 9, wherein the output meansrepresents each data element in the form of a figure having a size basedon the number of occurrences of the data element, and wherein thedistance between two figures is based on the relation of two representeddata elements. 12, A data analysis method, comprising the steps of:calculating dependence ratios of a plurality of data elements to beanalyzed; grouping the data elements according to the dependence ratios;and outputting the grouped data elements. 13, The data analysis methodaccording to claim 12, wherein the step of grouping includes the step ofgrouping a first data element and a second data element in the samegroup, if a ratio of the number of simultaneous occurrences of the firstdata element and the second data element to the number of occurrences ofthe first data element is greater than a predetermined value. 14, Thedata analysis method according to claim 13, wherein the step of groupingincludes the step of specifying a subordinate relation between the firstdata element and the second data element, based on the ratio of thenumber of simultaneous occurrences to the number of occurrences of thefirst data element.